The examples and datasets in this Lab session follow very closely two sources:
…
Topics discussed in Lecture # 4
Lecture 4: topics
….
R ENVIRONMENT SET UP & DATA
Needed R Packages
We will use functions from packages base, utils, and stats (pre-installed and pre-loaded)
We may also use the packages below (specifying package::function for clarity).
# Load pckgs for this R session# --- General library(here) # A Simpler Way to Find Your Files library(skimr) # Compact and Flexible Summaries of Datalibrary(paint) # Fancy structure for data frames library(janitor) # Simple Tools for Examining and Cleaning Dirty Datalibrary(tidyverse) # Easily Install and Load the 'Tidyverse'# --- Plotting & data visualizationlibrary(ggfortify) # Data Visualization Tools for Statistical Analysis Resultslibrary(ggpubr) # 'ggplot2' Based Publication Ready Plotslibrary(dagitty) # Graphical Analysis of Structural Causal Modelslibrary(ggdag) # Analyze and Create Elegant Directed Acyclic Graphs #library(patchwork) # The Composer of Plots# --- Descriptive statistics and regressionslibrary(broom) # Convert Statistical Objects into Tidy Tibbleslibrary(Hmisc) # Harrell Miscellaneouslibrary(estimatr) # Fast Estimators for Design-Based Inferencelibrary(modelsummary) # Summary Tables and Plots for Statistical Models and Data:library(sandwich) #for robust variance estimationlibrary(survey) # Analysis of Complex Survey Sampleslibrary(haven) # Import and Export 'SPSS', 'Stata' and 'SAS' Fileslibrary(simstudy) # Simulation of Study Datalibrary(NHANES) # Data from the US National Health and Nutrition Examination Study
—
ITE
Individual Treatment Effect (ITE) is defined as the difference in potential outcomes for a given individual \(ITE_i= y_{i}^1 − y_{i}^0\)
The Causal Effect (CE)
\(CE_i = Y_{1i} - Y_{0i}\) where \(Y_{1i}\) and \(Y_{0i}\) are the potential outcomes for each individual \(i\) being both exposed and not exposed respectively (not possible in the real world).
Simulation
def <- simstudy::defData(varname ="C", formula =0.4, dist ="binary")def <- simstudy::defData(def, "X", formula ="0.3 + 0.4 * C", dist ="binary")def <- simstudy::defData(def, "e", formula =0, variance =2, dist ="normal")def <- simstudy::defData(def, "Y0", formula ="2 * C + e", dist="nonrandom")def <- simstudy::defData(def, "Y1", formula ="1 + 2 * C + e", dist="nonrandom")def <- simstudy::defData(def, "Y_obs", formula ="Y0 + (Y1 - Y0) * X", dist ="nonrandom") # = Y1*X + (1-X)*Y0set.seed(2017)dt <-genData(10000, def)
paste0("Calculated mean causal effect = ", mean(dt[, Y1] - dt[, Y0])) #mean difference of counterfactual outcomes
# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 0.481 0.0225 21.3 8.84e-99
2 X 1.71 0.0335 51.1 0
DAG with COLLIDER
# df DAG dag_collid <-dagify( Y ~ X + e, Z ~ X + Y, exposure ="X",outcome ="Y", #latent = "e",# labelslabels =c("Z"="Collider", "X"="Exposure","Y"="Outcome", "e"="Unobserved \nerror"),# positionscoords =list(x =c(X =0, Y=4, Z =2 , e =5),y =c(X =0, Y =0, Z =2, e =1) )) %>%tidy_dagitty() # Plot DAG dag_col_p <- dag_collid %>%ggdag(., layout ="circle") +# decided in dagifytheme_dag_blank(plot.caption =element_text(hjust =1)) +# Nodes geom_dag_node(aes(color = name)) +# Labels dodged to avoid overlapgeom_dag_label(aes(label = label), color ="#4c4c4c", nudge_x =0.7, nudge_y =0.2) +geom_dag_text(color="white") +labs(title ="Causal map with COLLIDER (Z)" , #subtitle = "X = exposure, Y = outcome, Z = collider, e = random error", caption = ) +# Map colors to specific nodesscale_color_manual(values =c("X"="#9b6723", "Y"="#72aed8", "Z"="#d02e4c","e"="#A6A6A6"), guide ="none") # Remove legenddag_col_p
DAG with COLLIDER
DAG with CONFOUNDER
library(dagitty) # Graphical Analysis of Structural Causal Models # Graphical Analysis of Structural Causal Models # Graphical Analysis of Structural Causal Models library(ggdag) # Analyze and Create Elegant Directed Acyclic Graphs # Analyze and Create Elegant Directed Acyclic Graphs # Analyze and Create Elegant Directed Acyclic Graphs # df DAG dag_confounder <-dagify(Y ~ X + Z + e, X ~ Z, exposure ="X", outcome ="Y", #latent = "e",# labelslabels =c("Z"="Confounder","X"="Exposure","Y"="Outcome","e"="Unobserved \nerror"),# positionscoords =list(x =c(X =0, Y=4, Z =2 , e =5),y =c(X =0, Y =0, Z =2, e =1) )) %>%tidy_dagitty() # Add edge_type within pipe# Plot DAG dag_conf_p <- dag_confounder %>%ggdag(., layout ="circle") +theme_dag_blank(plot.caption =element_text(hjust =1)) +# Nodes geom_dag_node(aes(color = name)) +# Labels dodged to avoid overlapgeom_dag_label(aes(label = label), color ="#4c4c4c", nudge_x =0.7, nudge_y =0.2) +geom_dag_text(color="white") +labs(title ="Causal map with CONFOUNDER (Z)" , #subtitle = " ",caption = ) +# Map colors to specific nodesscale_color_manual(values =c("X"="#9b6723", "Y"="#72aed8", "Z"="#d02e4c","e"="#A6A6A6"), guide ="none") dag_conf_p
DAG with CONFOUNDER
DAG with MEDIATOR
library(dagitty) # Graphical Analysis of Structural Causal Models # Graphical Analysis of Structural Causal Models # Graphical Analysis of Structural Causal Models library(ggdag) # Analyze and Create Elegant Directed Acyclic Graphs # Analyze and Create Elegant Directed Acyclic Graphs # Analyze and Create Elegant Directed Acyclic Graphs # df DAG dag_med <-dagify( Y ~ Z + e, Z ~ X, exposure ="X", outcome ="Y", #latent = "e",# labelslabels =c("Z"="Mediator","X"="Exposure","Y"="Outcome","e"="Unobserved \nerror"),# positionscoords =list(x =c(X =0, Y=4, Z =2 , e =5),y =c(X =0, Y =0, Z =2, e =1) )) %>%tidy_dagitty() # Add edge_type within pipe# Plot DAG dag_med_p <- dag_med %>%ggdag(., layout ="circle") +# decided in dagifytheme_dag_blank(plot.caption =element_text(hjust =1)) +# Nodes geom_dag_node(aes(color = name)) +# Labels dodged to avoid overlapgeom_dag_label(aes(label = label), color ="#4c4c4c", nudge_x =0.7, nudge_y =0.2) +geom_dag_text(color="white") +labs(title ="Causal map with MEDIATOR (Z)" , #subtitle = "X = exposure, Y = outcome, Z = collider, e = random error", caption = ) +# Map colors to specific nodesscale_color_manual(values =c("X"="#9b6723", "Y"="#72aed8", "Z"="#d02e4c","e"="#A6A6A6"), guide ="none") # Remove legenddag_med_p
DAG with MEDIATOR
—-
DATASETS for today
Importing Dataset 1 (NHANES)
Name: NHANES, accessible from package NHANESDocumentation: See reference on the data downloaded and …. Sampling details: We’ll use a subset of the NHANES dataset, focusing on variables related to smoking status (SmokeNow), systolic blood pressure (BPSysAve), Body Mass Index (BMI) and age (Age).
set.seed(123) # Set seed for reproducibilitydata(NHANES)# Select relevant columns and drop rows with missing valuesnhanes_data <- NHANES %>%select(ID, Gender, Age, Race1, Height, Weight, BMI, SmokeNow, PhysActive, PhysActiveDays, BPSysAve, BPSysAve, BPDiaAve, TotChol, DirectChol, Diabetes, HealthGen, Education, HHIncomeMid, Poverty) %>%drop_na() %>%# make it smaller slice_sample(n =1000) # Randomly select 1000 rows using# Take a quick look at the data#paint(nhanes_data)
Dataset 1 (NHANES) Variables and their description
EXCERPT: see complete file in Input Data Folder
Variable
Type
Description
X
int
xxxx
ID
int
xxxxx
SurveyYr
chr
yyyy_mm. Ex. 2011_12
Gender
chr
Gender (sex) of study participant coded as male or female
Age
int
##
AgeDecade
chr
yy-yy es 20-29
Race1
chr
Reported race of study participant: Mexican, Hispanic, White, #Black, or Other
Education
chr
[>= 20 yro]. Ex. 8thGrade, 9-11thGrade, HighSchool, SomeCollege, or CollegeGrad.
HHIncome
chr
Total annual gross income for the household in US dollars
HHIncomeMid
int
Numerical version of HHIncome derived from the middle income # in each category. Ex. 12500 40000
Poverty
dbl
A ratio of family income to poverty guidelines. Smaller # numbers indicate more poverty Ex.. 0.95 1.74 4.99
Weight
dbl
Weight in kg
Height
dbl
Standing height in cm. Reported for participants aged 2 years or older.
BMI
dbl
Body mass index (weight/height2 in kg/m2). Reported for participants aged 2 years or older
Pulse
int
60 second pulse rate
BPSysAve
int
Combined systolic blood pressure reading, following the # procedure outlined for BPXSAR
BPDiaAve
int
Combined diastolic blood pressure reading, following the # procedure outlined for BPXDAR
DirectChol
dbl
Direct HDL cholesterol in mmol/L. Reported for participants aged 6 years or older
TotChol
dbl
Total HDL cholesterol in mmol/L. Reported for participants aged 6 years or older
Diabetes
chr
Study participant told by a doctor or health professional that they have diabetes
HealthGen
chr
Self-reported rating of health: Excellent, Vgood, Good, Fair, or Poor Fair
PhysActive
chr
Participant does moderate or vigorous-intensity sports, fitness or recreational activities (Yes or No).
SmokeNow
chr
Study participant currently smokes cigarettes regularly. (Yes or No)
...
...
...
CAUSAL MODELING WITH CONFOUNDER
Example of CONFOUNDER
Z = Age = confounder for (X = SmokeNow) and blood pressure (Y = BPSysAve)
data(NHANES)# Select relevant columns and drop rows with missing valuesnhanes_conf <- NHANES %>%select(ID, Age, SmokeNow, BPSysAve) %>%drop_na()# Take a quick look at the datapaint(nhanes_conf)
tibble [3108, 4]
ID int 51624 51624 51624 51630 51654 51666
Age int 34 34 34 49 66 58
SmokeNow fct No No No Yes No Yes
BPSysAve int 113 113 113 112 111 127
In this context:
Age influences both smoking (SmokeNow) and blood pressure (BPSysAve), making it a confounder.
SmokeNow directly affects BPSysAve.
Visualisation of CONFOUNDER
# linear rel qplot (Age, BPSysAve, data = nhanes_conf, color = SmokeNow) +geom_smooth(method ="lm", se =FALSE) +labs(title ="Scatterplot of Age and Blood Pressure by Smoking Status",x ="Age",y ="Blood Pressure (mmHg)",color ="Smoking Status")
Visualisation of CONFOUNDER
Regression analysis - controlling for the confounder
Unadjusted model:
# Unadjusted linear model model_unadjusted <-lm(BPSysAve ~ SmokeNow, data = nhanes_conf)summary(model_unadjusted)$coefficients
Estimate Std. Error t value Pr(>|t|)
(Intercept) 124.38 0.432 287.97 0.0e+00
SmokeNowYes -4.26 0.643 -6.64 3.8e-11
Estimate Std. Error t value Pr(>|t|)
(Intercept) 100.997 1.0886 92.78 0.00e+00
SmokeNowYes 0.684 0.6313 1.08 2.79e-01
Age 0.432 0.0187 23.09 5.19e-109
Compare models (without and with confounder)
Adjusting for Age changes the estimated effect of SmokeNow on blood pressure (BPSysAve)
Unadjusted Model: This model ‘does not control for age’, so any observed relationship between smoking and blood pressure might be confounded.
Adjusted Model: This model ‘controls for age’, providing a more accurate estimate of the causal effect of smoking on blood pressure by accounting for the confounding influence of age.
Adjusting for Age changes the estimated effect of
SmokeNow on blood pressure (BPSysAve)
NO Confounder
Confounder
* p
(Intercept)
124.384**
100.997**
(0.432)
(1.089)
SmokeNowYes
-4.264**
0.684
(0.643)
(0.631)
Age
0.432**
(0.019)
Num.Obs.
3108
3108
R2
0.014
0.158
Compare predicted values from 2 models
With tidyr::pivot_longer() we will reshape the dataset so we can plot the predicted values from both models in a faceted plot.
# Add predicted values from the unadjusted and adjusted modelsnhanes_conf <- nhanes_conf %>%mutate(pred_unadjusted =predict(model_unadjusted), # Predicted BPSysAve from unadjusted modelpred_adjusted =predict(model_adjusted) # Predicted BPSysAve from adjusted model )# Reshape the data into long format using pivot_longer()nhanes_long <- nhanes_conf %>%pivot_longer(cols =c(pred_unadjusted, pred_adjusted), names_to ="model", values_to ="predicted") %>%mutate(model =recode(model, pred_unadjusted ="Unadjusted: SmokeNow on BPSysAve", pred_adjusted ="Adjusted: SmokeNow + Age on BPSysAve"))
Plot predicted values from 2 models
# Violin plot with dashed line connecting the two conditionsggplot(nhanes_long, aes(x =factor(SmokeNow), y = BPSysAve, fill = model)) +# Create the violin plot to show the distribution of blood pressuregeom_point(color ="grey", alpha =0.4, position =position_dodge(width =0.3) ) +# Overlay the predicted dashed line between the two smoking conditionsgeom_line(aes(y = predicted, group = model, color = model), size =0.8, linetype ="dashed", position =position_dodge(width =0.3)) +# Facet vertically for unadjusted and adjusted modelsfacet_wrap(model ~ ., scales ="free_y",ncol =2) +# Add labels and titlelabs(title ="Confounder Analysis: Predicted values in Unadjusted vs Adjusted Regression models",subtitle ="Age = Confounder",x ="Smoking Status",y ="Systolic Blood Pressure (BPSysAve)",color ="Model",fill ="Model" ) +# Customize colors for better contrastscale_fill_manual(values =c("Unadjusted: SmokeNow on BPSysAve"="lightcoral" , "Adjusted: SmokeNow + Age on BPSysAve"="lightblue")) +scale_color_manual(values =c("Unadjusted: SmokeNow on BPSysAve"="red", "Adjusted: SmokeNow + Age on BPSysAve"="blue")) +# Minimal theme for claritytheme_minimal()+theme(legend.position ="none")
Plot predicted values from 2 models
CAUSAL MODELING WITH MEDIATOR
The case of a mediator
M = BMI is a mediator for the effect of X = PhysActiveDays on Y = BPSysAve
data(NHANES)# Select relevant columns and drop rows with missing valuesnhanes_med <- NHANES %>%select(Gender, Age, BPSysAve, BMI, PhysActiveDays, Diabetes, SmokeNow) %>%drop_na()# Take a quick look at the datasummary(nhanes_med)
Gender Age BPSysAve BMI PhysActiveDays
female:632 Min. :20.0 Min. : 81 Min. :15.0 Min. :1.0
male :831 1st Qu.:34.0 1st Qu.:111 1st Qu.:23.6 1st Qu.:2.0
Median :47.0 Median :119 Median :27.0 Median :3.0
Mean :47.8 Mean :122 Mean :27.9 Mean :3.6
3rd Qu.:60.0 3rd Qu.:131 3rd Qu.:31.3 3rd Qu.:5.0
Max. :80.0 Max. :221 Max. :59.1 Max. :7.0
Diabetes SmokeNow
No :1312 No :834
Yes: 151 Yes:629
Fitting unadjusted (total) model
Before adjusting for any mediator, we fit a simple linear model to estimate the total effect of PhysActiveDays on Systolic Blood Pressure (disregarding any possible mediation of BMI).
Total Effect Model (Unadjusted)
This model gives the total effect of of PhysActiveDays on Systolic Blood Pressure (including any indirect effects through M = BMI or other variables that haven’t been included.)
# Unadjusted model: total effect of smoking on blood pressuretotal_effect_model <-lm(BPSysAve ~ PhysActiveDays, data = nhanes_med)summary(total_effect_model)$coefficients
Estimate Std. Error t value Pr(>|t|)
(Intercept) 120.12 1.055 113.88 0.0000
PhysActiveDays 0.59 0.262 2.25 0.0246
Set up the mediation models 1/2
Then, we fit the model to estimate the effect of X = PhysActiveDays on M = BMI:
Mediator model
This model allows us to see whether more physically active days are associated with (lower) BMI values (which could then affect blood pressure).
Calculate the Indirect, Direct, and Total Effects (1/2)
Now that we have the coefficients from the models, we can calculate the mediation effects manually.
Direct Effect: This is the coefficient of PhysActiveDays in the outcome model, which gives the direct effect of exercise on blood pressure (controlling for BMI = M)
Indirect Effect: This is the product of the coefficient from the mediator model (PhysActiveDays → BMI) and the coefficient from the outcome model (BMI → BPSysAve).
Total Effect: This is the coefficient of PhysActiveDays in the unadjusted model (NOT controlling for BMI = M).
Calculate the Indirect, Direct, and Total Effects (2/2)
Breakdown of the estimated effects:
# TOTAL effect: The effect of PhysActiveDays on BPSysAve without adjusting for BMItotal_effect <-coef(total_effect_model)["PhysActiveDays"]total_effect # 0.59
Calculate the Indirect, Direct, and Total Effects (2/2)
PhysActiveDays
0.59
# DIRECT effect: The effect of PhysActiveDays on BPSysAve after adjusting for BMI.direct_effect <-coef(outcome_model)["PhysActiveDays"] # Direct effect (controlling for BMI)direct_effect # 0.621
PhysActiveDays
0.621
# Indirect effect: The portion of the effect on BPSysAve that is mediated through BMIindirect_effect <-coef(mediator_model)["PhysActiveDays"] *coef(outcome_model)["BMI"] indirect_effect # [-0.0821 X 0.382] = -0.0313
PhysActiveDays
-0.0313
# or total_effect - direct_effect # = -0.0313
PhysActiveDays
-0.0313
Proportion mediated
# Proportion mediated: The proportion of the total effect that is mediated through BMIproportion_mediated <- glue::glue("{round(indirect_effect / total_effect *100, 2)}%")proportion_mediated # -0.0531
# Create a data frame for the effectseffects_df <-data.frame(Effect =c("Total Effect", "Direct Effect", "Indirect Effect"),Value =c(total_effect, direct_effect, indirect_effect))# Create the bar plotggplot(effects_df, aes(x = Effect, y = Value, fill = Effect)) +geom_bar(stat ="identity") +labs(title ="Mediation Analysis: Total, Direct, and Indirect Effects", subtitle ="Effect (coefficient) of Physical Activity on Systolic Blood Pressure (BPSysAve)",y ="", x ="") +theme_minimal()
Interpret results
The total effect tells us the overall relationship between PhysActiveDays and BPSysAve.
The direct effect represents the relationship between PhysActiveDays and BPSysAve, controlling forBMI.
The indirect effect (mediated effect) shows how much of the relationship between PhysActiveDays and BPSysAve is explained through BMI (AKA the coef of PhysActiveDays in the mediator model times the coef of BMI in the outcome model)
The proportion mediated indicates how much of the total effect is due to the mediation by BMI.
Calculate the Indirect, Direct, and Total Effects (modo2)
# Reshape the data into long format for facetingnhanes_long <- nhanes_med_pred %>%gather(key ="model", value ="predicted", pred_bps_adj, pred_bps_total) %>%mutate(model =recode(model, #pred_bmi = "Mediator Effect: BPSysAve on BMI", pred_bps_adj ="Adjusted Effect: PhysActiveDays + BMI on BPSysAve", pred_bps_total ="Total Effect: PhysActiveDays on BPSysAve"))
Plot results 2/2
# Plot with vertically aligned facetsggplot(nhanes_long, aes(x = BMI)) +geom_point(aes(y = BPSysAve), color ="black", shape =16, alpha =0.5) +# Actual data points for BPSysAvegeom_line(aes(y = predicted, color = model), size =1) +# Predicted values from different modelsfacet_grid(model ~ ., scales ="free_y") +# Facet verticallylabs(title ="Mediation Analysis: Total, Adjusted, and Mediator Effects",x ="BMI",y ="Outcome / Mediator",color ="Effect Type") +theme_minimal()
# Select relevant columns and drop rows with missing valuesdat_conf <- dat %>%select(id, age, bp_systolic, fv, exerc, bmi, smoker, gender, hh_income_usd, ln_hh_income_percap ) %>%drop_na() %>%filter(fv <=1500)# Take a quick look at the datasummary(dat_conf)Hmisc::describe(dat_conf$bp_systolic) # YHmisc::describe(dat_conf$bmi) # XHmisc::describe(dat_conf$age) # Z (confounder )Hmisc::describe(dat_conf$exerc) # Z (confounder )Hmisc::describe(dat_conf$ln_hh_income_percap) # Z (confounder )
In this context:
exerc influences both smoking (fv) and blood pressure (bp_systolic), making it a confounder (albeit as PROXY for socio-economic status)
fv directly affects bp_systolic
— Regression analysis
# Unadjusted modelmodel_unadjusted <-lm( bp_systolic ~ bmi + ln_hh_income_percap , data = dat_conf)summary(model_unadjusted)# Adjusted modelmodel_adjusted <-lm(bp_systolic ~ bmi + age + ln_hh_income_percap, data = dat_conf)summary(model_adjusted)
adjusting for Z changes the estimated effect of X on Y
Unadjusted Model: This model does not control for Z, which is a confounder of the relationship between X and Y.
Adjusted Model: This model controls for Z, providing a more accurate estimate of the causal effect of X on Y.
— Add Predicted Values to the Dataset
model_unadjusted$coefficients model_adjusted$coefficients# Get predicted values using augment and attach to the original datasetdat_unadjusted <- broom::augment(model_unadjusted, data = dat_conf) %>%select( -.resid, -.hat, -.sigma, -.cooksd, -.std.resid ) %>%rename(pred_unadjusted = .fitted)dat_adjusted <- broom::augment(model_adjusted, data = dat_conf) %>%select( -.resid, -.hat, -.sigma, -.cooksd, -.std.resid ) %>%rename(pred_adjusted = .fitted)# Combine the predicted values from both modelsdat_conf_long <- dat_unadjusted %>%left_join(dat_adjusted, by ="id", suffix=c("",".y"))%>%select(-ends_with(".y")) %>%# Convert the data from wide to long formatpivot_longer(cols =c(pred_unadjusted, pred_adjusted), names_to ="model", values_to ="predicted")# Check the structure of the long datasetstr(dat_conf_long)
— Plot results 1/2
Reshape the Data Using pivot_longer()
We will reshape the dataset so we can plot the predicted values from both models in a faceted plot.